NL and Speech in the MULTEXT Project

نویسندگان

  • Jean Véronis
  • Daniel Hirst
  • Robert Espesser
  • Nancy Ide
  • Robert Schuman
چکیده

MULTEXT is the largest project funded under the LRE Program, intended to contribute to the development of generally usable software tools to manipulate and analyse multi-lingual text and speech, and to annotate multi-lingual text and speech corpora with structural and linguistic markup. It will attempt to establish conventions for the encoding of such corpora, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT will also work towards establishing a set of guidelines for linguistic software development, which will be widely published in order to enable future development by others. The place of speech in the overall project is intended to explore the possibilities of integrating NL and speech processing by attempting to harmonize tools and methods from both areas. MULTEXT will focus on phenomena at the intersection of the two domains, in particular prosody, whose supra-segmental nature invites research into the complex relationships it holds with morphology and syntax.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwell’s Nineteen Eighty-Four, with versions in all six languages...

متن کامل

The MULTEXT-East corpus

The EU MULTEXT-East project has produced harmonised language resources for Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In this paper we introduce the MULTEXT-East multilingual corpus, which comprises marked-up texts in the six languages totaling approximately 2 million words and a small speech corpus. The corpus is encoded in SGML, in the TEI-like Corpus Encoding Specification...

متن کامل

A French Phonetic Lexicon with Variants for Speech and Language Processing

This paper reports on a project aiming at the semi-automatic development of a large orthographic-phonetic lexicon for French, based on the Multext dictionary. It details the various stages of the project, with an emphasis on the methodological and design aspects. Information regarding the lexicon’s content is also given, together with a description of interface tools which should facilitate its...

متن کامل

MULTEXT: Multilingual Text Tools and Corpora

MULTEXT (Multilingual Text Tools and Corpora) is the largest project funded in the Commission of European Communities Linguistic Research and Engineering Program. The project will contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create multi-lingual text corpora with structural and linguistic markup. It will attempt to establish conv...

متن کامل

Persian in MULTEXT-East Framework

Farsi, also known as Persian, is the official language of Iran, Tajikistan and one of the two main languages spoken in Afghanistan. It is an Indo-European agglutinating language, written in Arabic script. This paper presents the first step in creating Farsi basic language resources kit. This Step comprises the specifications for morphosyntactic encoding, which is based on the EAGLES/MULTEXT mod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994